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Abstract. Massive spectroscopic surveys of high-redshift quasars yield large numbers of 
correlated Lya absorption spectra that can be used to measure large-scale structure. Simu- 
lations of these surveys are required to accurately interpret the measurements of correlations 
and correct for systematic errors. An efficient method to generate mock realizations of Lya 
forest surveys is presented which generates a field over the lines of sight to the survey sources 
only, instead of having to generate it over the entire three-dimensional volume of the survey. 
The method can be calibrated to reproduce the power spectrum and one-point distribution 
function of the transmitted flux fraction, as well as the redshift evolution of these quantities, 
and is easily used for modeling any survey systematic effects. We present an example of how 
these mock surveys are applied to predict the measurement errors in a survey with similar 
parameters as the BOSS quasar survey in SDSS-III. 
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1 Introduction 

The hydrogen Lya absorption spectra of high-redshift sources are being revealed as an ex- 
tremely powerful tool for the study of large-scale structure in observational cosmology. The 
numerous absorption features observed in the spectra of quasars usually described as the 
" Lya forest " were originally interpreted as discrete gas clouds, but have been better un- 
derstood and described as arising from the continuous cosmic web of filamentary structures 
that is expected in the Cold Dark Matter model of structure formation. Results from hy- 
drodynamic cosmological simulations have shown that the observed properties of the Lya 
forest are generally in good agreement with the hypothesis of a photoionized intergalactic 
medium with density fluctuations that are related to the same primordial perturbations that 
give rise to the galaxy distribution and the Cosmic Microwave Background fluctuations (e.g., 
[1, 2]). The Lya forest spectra should therefore be considered as a continuous field of the 
Lya transmitted fraction F(x) (where x is the redshift-space coordinate), which is related 
to the variations of the gas density, peculiar velocity and temperature along the line of sight, 
and eventually to the primordial density field, particularly on large scales, in which the 
complexities of non-linear evolution become less important. 

In fact, if we have a large number of absorption spectra from different sources covering 
a large volume and with a sufficiently dense sampling, one can measure the redshift space 
power spectrum of the field F(x). In the limit of large scales, this power spectrum should be 
related to the linear power spectrum of density perturbations as (see [3-5]) 



where fik is the cosine of the angle of the wavevector k in Fourier space relative to the line 
of sight, and Pl is the linear power spectrum of the mass density perturbations. This is the 
same form of the linear power spectrum derived by [6] for any class of observed objects with 
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a bias factor bg, which relates the amplitude of observed fluctuations to the amplitude of the 
underlying mass fluctuations. But for the Lya forest, the redshift distortion parameter f3 
depends on a second bias factor that is related to the response of the mean value of F to a 
large-scale peculiar velocity gradient, and must be determined independently. 

Therefore, the promise of massive spectroscopic surveys of Lya absorption spectra is to 
help determine the shape of -Pl(&) over a wide range of scales and redshifts, and to use this 
to obtain crucial cosmological measurements, such as the angular and redshift scale of the 
Baryon Acoustic Oscillations, or the effect of neutrinos on the power spectrum (e.g., [7]). In 
addition, one can determine the values of b$ and (3 at each redshift, which are in principle 
predictable with hydrodynamic simulations from the small-scale physics that determine the 
properties of the Lya forest ([5]). A first step in this direction was recently accomplished by 
[8] from the first analysis of the quasar absorption spectra in the BOSS survey. 

Accurately measuring the power spectrum requires a careful evaluation and correction 
of any systematic errors that may be present in this measurement in the analysis of real 
data. The only way to reliably doing this is by generating several random realizations of 
the multiple Lya absorption spectra in a survey, and introducing into them any possible 
systematic effects to see how they may impact the inferred power spectrum in the end. Some 
of the systematic effects that need to be considered are the following: errors in the modeling 
of the quasar continuum C(A), which is needed to evaluate the transmitted fraction from 
the observed flux, /(A) = C(X)F(X); variable spectral resolution and noise; flux calibration 
errors; the impact of the redshift evolution of the Lya forest; the presence of damped Lya , 
Lyman limit systems and metal absorption lines in the spectra; or variations in the intensity 
of the cosmic ionizing background. Modeling these systematic effects as accurately and 
reliably as possible requires our ability to generate mock surveys of Lya absorption spectra 
in large numbers, for many different cases, and in a way that can be easily used. These mock 
surveys must include a large number of lines-of-sight over large volumes (like the ongoing 
BOSS survey in SDSS-III; [9]), and somehow include the small-scale fluctuations of the Lya 
forest that are present in the observed spectra of sources that are point-like for practical 
purposes. 

Generating these mock surveys directly from three-dimensional simulations, by selecting 
lines of sight from them, presents several difficult challenges. The first is that having a 
large enough volume to correctly simulate the power spectrum, at least up to scales as 
large as the BAO peak, implies that the resolution of the simulations cannot capture the 
smallest relevant scales for the Lya forest. In addition, when using large three-dimensional 
simulations, the computer resources that are required may not allow obtaining many mocks 
that are independent, or changing the parameters of these mocks in an efficient and fast way 
to enable a large number of tests. 

This paper presents a method to efficiently create these mock surveys of Lya absorption 
spectra, taking advantage of the fact that the transmitted fraction F needs to be generated 
only on the discrete lines of sight to the survey sources. The method consists of generating 
one-dimensional fields for each line of sight and introducing correlations among them as 
if they had been drawn from a three-dimensional field. The capacity that is lost with this 
method is using hydrodynamic simulations that include the non-linear gravitational evolution 
of density fluctuations and other physical effects to simulate the field F(x). However, if we 
care only about the large-scale power spectrum of this field and the errors to which it can be 
measured, it is in principle enough to ensure that the mocks have the same variance in the 
small-scale fluctuations to reproduce their effect on large scales. The way the mock surveys 
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are generated is by using an input power spectrum of F(pc) in redshift space that includes a 
non- linear correction for small scales, and which is assumed to be calibrated from the results 
of cosmological simulations with enough resolution or directly from the observational results. 
The mocks can also incorporate any desired one-point distribution of F and the redshift 
evolution of both the power spectrum and the distribution of F. 

Hence, the philosophy of these mock surveys is that they are generated from an input 
model of the power spectrum and other quantities, and that they should be used for predicting 
the large-scale correlation measurements of the Lya forest and the way they are affected by 
any systematic errors that can be introduced. However, the field F(x) that is simulated is 
purely local and inferred from the linear overdensity, so it does not reproduce the 3-point or 
higher n-point correlations of the Lya forest. 

The method is presented in detail in §2, and an application to an example of a survey 
similar to BOSS is presented in §3. Another application of these mocks to simulate the effect 
of damped Lya systems is discussed in [10]. This method was already used for simulating 
the sample of spectra used in [8], and is being improved for application to the final BOSS 
survey. 

A standard flat ACDM cosmology is used in this paper with the following parameters: 
h = 0.72 , Q m = 0.281, cr 8 = 0.85, n s = 0.963, tt b = 0.0462. 

2 Method to generate mocks of correlated Lya spectra 

A Lya forest spectrum is given by the fraction of transmitted flux, F = exp(— r), where 
r is the optical depth, at each observed wavelength. We define the comoving coordinate 
in redshift space, x, related to the wavelength by dx = c/H(z)(d\/\ a ), where H{z) is 
the Hubble constant, the redshift is 1 + z = X/X a and X a = 1216 A is the Lya resonance 
wavelength. The observed spectrum is the product of F(x) times the continuum of the source, 
which is not independently observed and must be modeled. We shall not deal in this paper 
with the issue of modeling the continuum. Our mocks are realizations of the function F{x) 
on multiple, correlated lines of sight. 

In this paper we shall generally work with the variable 



where F is the mean value of F at a given redshift. All the 2-point correlations appearing in 
this article are of this 5f variable unless otherwise stated. This section describes the method 
to generate a set of mock Lya spectra with any specified distribution function and power 
spectrum for the 5f variable. The main idea for the case of a Gaussian field is explained 
in 2.1, which is then generalized to any desired distribution of 5f (2.2). The inclusion of 
redshift evolution is discussed in 2.3. 

2.1 Generation of a Gaussian random field 

The most important requirement that our mock Lya spectra must meet if they are to ac- 
curately predict any systematic and statistical errors in the measurements of large-scale 
correlations in 5f is that they have a redshift space power spectrum of the flux that accu- 
rately matches the observed one. In this way, the intrinsic variance of the Lya absorption at 
any scale can be reproduced, and the way it affects the sampling errors on all other scales is 
correctly taken into account. Our method to generate mock Lya spectra can take as input 
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any desired power spectrum PF(k\\,k±) in redshift space, where fen, k± are the components 
of the wave vector in Fourier space parallel and perpendicular to the direction of the line of 
sight. 

2.1.1 Sampling the volume unevenly 

The usual way to generate a Gaussian random field in realizations of cosmological pertur- 
bations is to generate first a set of independent Fourier modes in a three-dimensional cubic 
box with a specified power spectrum, and then doing the Fourier transform to obtain the 
real-space field. This method yields the value of the field at all the cells in the cubic volume 
at once. 

However, to simulate the measurement of correlations up to the BAO scale in a survey of 
quasar spectra, we need to cover a volume with a size of at least several times the BAO scale, 
with a required resolution needed to capture the fluctuations in the low-density intergalactic 
medium of at least \j/(2ir) = a/3/2 c s t, or ~ 100 comoving kpc (where Aj is the Jeans 
length, c s is the sound speed of the intergalactic gas, and t the age of the universe; well- 
resolved simulations of the IGM typically use cells a factor of a few smaller than the Jeans 
length; see, e.g., [11, 12]). The minimum dynamic range from the smallest to the largest 
scale is then ~ 10 4 , or 10 12 simulated points (and even larger if the entire volume of a survey 
like BOSS is to be generated), which results in a serious computational problem for being 
able to easily generate large numbers of mocks in a simple way. 

Our method uses the fact that we are only interested in the values of the field along a 
number of infinitely thin lines of sight traced by the quasar light. Hence, we can generate a 
Gaussian field on these one-dimensional lines only, and introduce correlations among them 
directly in real space. A first, simple-minded way to achieve this might be to first generate 
an independent Gaussian variable at each pixel, gi, and then combine them to generate the 
final field 5 g j = Lijgi which has the desired correlation dy. 

Cij = < 5gi5 g j > = < L ik g k Ljigi > = L ik Lji5 k i = L ik Lj k . (2.2) 

A particularly efficient way to obtain the required matrix L for the transformation is the 
result of the Cholesky decomposition of the covariance matrix C, i.e., a lower triangular 
matrix L obeying C = LL T . Numerically, there are several algebraic packages that perform 
the Cholesky decomposition very efficiently. 

For a practical application, the number of pixels that are needed to model a typical 
observed spectrum and to include the power down to the smallest relevant scales is N p ~ 10 3 
for each line of sight. For a survey with N q quasars, the total number of elements of the 
correlation matrix C that need to be computed is (N p x N q ) 2 . Clearly, this method would 
break down for a relatively small number of quasars. Fortunately, there is a better way to 
do it. 

2.1.2 Parallel lines of sight 

Let us assume for the moment that the lines of sight in the survey are perfectly parallel. Let 
<5 9 (x||,x^) be the correlated Gaussian variable we want to generate at the position xy of the 
line of sight at coordinate xj_. We can do the one-dimensional Fourier transform of 5 g on the 
direction of the line of sight only, to obtain 5 g (k\\ , xj_). These one-dimensional Fourier modes 
have the following correlation: 

(4 (fc|| , x ± ) S g (*j , x ± ') ) = f dk ± exp(ik ± x ± ) j dk ± > exp(ik ± 'x ± ') (2.3) 
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where the symbol 5 stands for the Dirac delta function, -P(k) is the power spectrum of 5 g , 
and 



The crucial property is that the one-dimensional modes 6 g on different lines of sight are 
independent except when fey = ki . Therefore, the problem is now separated for each value of 
fen, and the Cholesky decomposition operation needs to be performed on N p matrices of size 
N q x N q only. 

Hence, the procedure to be followed in our method is as follows. We first choose a 
grid of values of fey for the Fourier transforms on the line of sight. For each value of fey, we 
compute the correlation of the one-dimensional Fourier modes for every pair of lines of sight, 
using equations (2.3) and (2.4). Each one of these N q x N q matrices, C\- = P x (fey , r±), is then 
Cholesky-decomposed to obtain a matrix L^. After generating a set of independent Gaussian 
variables for each quasar and each value of fey, gk q , we compute the new set 5 g = Lkg, and 
we then do the inverse one-dimensional Fourier transform of these to finally obtain the S g 
variables, with all the real space correlations that are implied by the input 3-d power spectrum 



In reality, the Lya spectra need to be generated for quasars that are at different redshifts. 
We do this by first generating the spectra lines of sight of a long enough comoving length 
L, evaluating 5 g on bins of comoving width Ax. We set the center of the line of sight at 
a central redshift z c (we use z c = 2.6 in this paper), and every bin is then mapped into a 
redshift according to its comoving coordinate. We then use only the part of the spectrum 
of each quasar that is in the restframe wavelength range for Lya forest analyses. We use 
1041 A< A r < 1185 A in this paper, the usual range to avoid Ly/3 contamination and the 
proximity effect zone near the quasar. We also use L = 4096 /i" 1 Mpc, long enough to make 
any periodicity effects negligible, and Ax = 0.5/i -1 Mpc, slightly smaller than the typical 
pixel width in the BOSS spectrograph (1 A ~ 0.7 h~ l Mpc at the redshifts of interest). 

2.2 Flux distribution 

The principal goal of the mocks of correlated Lya forest spectra we want to generate is 
to simulate the observed spectra in a survey like BOSS that includes all of the statistical 
and systematic errors we may consider to obtain a correction for them when computing any 
statistical property. It is therefore important that the perturbation in the transmitted flux 
fraction, <5p, in the mock spectra has the same distribution as the observed one, in order that 
the impact of continuum fitting and noise on the measured correlations and their errorbars 
are correctly simulated. Note that the value of the noise that is added in the mocks and 
the way that the continuum fitting is obtained will depend on a complex way on the values 
of 5f- Here we generalize our method to generate a field Sf with the desired probability 
distribution function pf($f) and any power spectrum Pp(k). Although the higher order n- 
point correlations of F will obviously still be different for the mocks and the real Lya forest 
spectra, we expect this to have negligible impact on the computed errors of any statistical 
measurements on large scales (e.g., [2] found that errors computed as if the flux field was 
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Gaussian were close to errors determined by bootstrapping over real spectra, even on Mpc 
scales). 

This generalized method consists of generating first our field S g with a Gaussian dis- 
tribution, Pg(Sg) = ex.p(—5g/2)/\/2ir, with a different power spectrum P g such that, after 
transforming the field to the new variable the desired probability distribution func- 

tion pf(&f) and power spectrum Pp are obtained. The required transformation Sf(S 9 ) is 
obtained by integration of the equation 

dS g pf{of) 

Let us consider the correlation functions £_f(?"12) an d Cg( r i2) of the field values at two 
points xi and X2 separated by the distance r±2- We designate these field values as Sfi, S F 2, 
&gi, $g2- Since the field 5 g is strictly Gaussian, the correlation functions are related by 

£ F (r Vi )=(6 F1 6 F2 ) (2.6) 

-l/F-l fl/F-l 



d5 F \ J (16f2P2f(Sfi,Sf2)SfiSf2 

/oo 
dSg2 P2g(Sgl,dg 2 ) 6 F \6 F2 
-OO 



OO J — oo 

^1 + ^2-2^1^^12)' 



cxp 

oo /-oo 



2(l-e 9 2 (n 2 )) 

ddgl / d6g2 , F (Ogl) 0F(Sg2) ■ 



oo ^-00 27T,/l -B(ri 2 ) 



Note that we have assumed that the Gaussian field has unit variance, without loss of 
generality as an overall normalization factor can always be included in the definition of the 
mapping <5p(£ 9 ). This relation between the two correlations £p and £ g is actually a one- 
dimensional function that is totally independent of the separation r\ 2 or any other variable: 
it depends only on the relation <5p(5 g ). We can therefore tabulate and invert the function 

The procedure to generate a random field Sf is therefore the following: we start with 
an input model for the three-dimensional power spectrum Pp of the flux transmission, and 
compute the Fourier transform to obtain £p. We then convert this to the correlation function 
£ g , and proceed to compute the correlations of one-dimensional power for the Gaussian field 
g in equation 2.4), which can be re-expressed as: 

/oo 
dr\\e ik \n£ g (r p r ± ) . (2.7) 
-oo 

We mention here that this procedure does not in general work for any distribution function 
Pf(5f), because sometimes the resulting power P g x(k\\,r± = 0) may be negative for some 
values of fen. An auto-power spectrum must be positive definite, because the variance of 
Fourier modes can never be negative (for r± ^ 0, P g x(k\\,r±) can be negative). Fortunately, 
this does not occur for the input model chosen here, but it may well occur with other 
distributions (see [13] for a discussion of the same problem in the context of non-Gaussian 
initial conditions). 
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2.3 Redshift evolution and non-parallel lines of sight 

The power spectrum of Sf is a function of redshift. The main evolution is in the amplitude 
of the power spectrum, but a more general evolution in the shape is likely to be present, 
particularly on small scales. To introduce the redshift evolution in our model, we generate the 
field 5f for several discrete values of the redshift, obtaining a set of realizations 5fi(x^,x±), 
where the subindex i labels the redshift. Each of these realizations is generated with the same 
amplitudes and complex phases of the Fourier modes 5 g , and varying only the amplitude of 
the power spectrum that is different due to the evolution with redshift. In other words, 
the realizations at different redshifts have all the same random elements and change only 
because of the variation in the power spectrum, so we can smoothly interpolate between 
them to obtain the generated field at any desired redshift. 

The effect of the variation of the angular diameter distance and Hubble constant with 
redshift, and the fact that the lines of sight are not parallel, is included in the same way as 
the redshift evolution. The power spectrum can be expressed in terms of fixed angular and 
redshift separations at the discrete values of the redshift at which the multiple fields dpi are 
generated. 

The final field 5f is obtained by linear interpolation of the multiple fields as the redshift 
varies along the lines of sight, introducing in this way the gradual evolution in the power 
spectrum amplitude, the Hubble constant and the angular diameter distance with redshift. 

In this paper, the redshift values at which the fields 5fi are generated are z = 1.96, 
2.44, 2.91, and 3.39. 

2.4 Input model for Lya forest mocks used in this paper 

The distribution and power spectrum of the transmitted flux fraction can be determined from 
observations and can also be computed in theory from hydrodynamic cosmological simula- 
tions of the intergalactic medium. As observational progress is made, mocks of Lya forest 
surveys can be adjusted to reproduce as accurately as possible the observational determina- 
tions of the distribution and power spectrum of 5f, which guarantees an accurate modeling 
of the measurement errors for any quantities. Here, we use the parameterized fitting formula 
introduced by [5] to fit the results of the power spectrum from several numerical simulations, 



where b$ is the density bias parameter at z = 2.25, f3 is the redshift distortion parameter, 
Mfc — k\\/k, PL,(k) is the linear matter power spectrum, and DF(k,Hk) is a non-linear term 
that approaches unity at small k. This form of Pp is the expected one at small k in linear 
theory, and provides a good fit to the observations reported in [8]. Note that we do not 
generate a density and a velocity field, but we directly generate the Lya forest absorption field 
instead, with the redshift distortions being directly introduced in the input power spectrum 
model of equation (2.8), with the free parameter j3 that measures the strength of the redshift 
distortion. 

We use the parameters given in the central model of [5], b = —0.1315 and f3 = 1.58 (the 
negative sign of b simply reflects the decrease of 5f with gas density, and does not affect any 
equations in this paper because it always appears as b 2 ). Only the amplitude of the power 
spectrum is assumed to evolve with redshift, following a power-law: 



P F (k, fi k ) = b 2 s (l + ^l) 2 P L {k)D F (k, fi k ) 





(2.9) 
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We use the value a = 3.8 in this paper, as suggested by the evolution of the one-dimensional 
P(k) measured in [2]. 

For the probability distribution, we use a log-normal model for the optical depth r, 



F = e 



— T 



exp (— ae 19 ) 



(2.10) 



where g is a Gaussian variable of unit dispersion, and a and 7 are two free parameters 
determining the mean transmission F and its variance. 

In the future, a new distribution for F that more accurately matches the observed one 
should be used for the mocks, but the log-normal approximation suffices for the purpose of 
this paper of demonstrating the applications of Lya forest mocks. 

We assume a mean transmitted fraction that approximately matches the observations 



The values of a and 7 at each redshift can be derived by requiring the mean value of F 
to match equation (2.11), and its dispersion to reproduce the value implied by the power 
spectrum Pp. The result for the parameters at the four redshifts we use are the following: 
a = 0.065 and 7 = 1.70 at z = 1.96; a = 0.141 and 7 = 1.53 at z = 2.44; a = 0.275 and 
7 = 1.38 at z = 2.91; and a = 0.487 and 7 = 1.24 at z = 3.39. 

3 Results 

This section presents the results for the characteristic errors in the measurement of the cor- 
relation function, as an example of a simulated Lya forest survey with similar characteristics 
as BOSS. 

3.1 Model for the quasar survey 

The first step to generate a mock Lya forest survey is to generate the quasar sample. We 
randomly distribute quasars (with no clustering) over a circular area A = 300 deg 2 and the 
redshift range 2.15 < z < 3.5, following the quasar luminosity function measured in [14] up 
to a limiting magnitude of g = 22. We select only 75 % (independently of g magnitude and 
redshift) of the quasars in order to have a quasar number density closer to the one obtained in 
the BOSS survey (~ 15 — 17deg~ 2 ). The total number of quasars in the sample is N q ~ 5000. 
The code we use to generate the absorption fields with the method described in Section 2 
was able to generate all the absorption spectra in one survey mock with a node with 8 CPU 
in a few hours. 

The redshift distribution of the sources in a real survey usually differs substantially 
from that inferred from the model luminosity function, mainly because the target selection 
efficiency has a strong dependence on redshift. In particular, in the optical color selection 
used by SDSS, quasars at z ~ 2.7 overlap the stellar locus and are confused with stars, making 
them harder to select. There is also a change in efficiency as a function of the foreground 
stellar density and dust absorption. We do not include these effects here. If anything, these 
effects should reduce the errors of measuring the Lya correlation because they should cause 
an increased overlap of the Lya spectra redshift range and an increased number of quasar 
pairs at small separations, for fixed mean quasar density. 



([2]): 




3.2 



(2.11) 
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After having constructed the spectra of the transmitted fraction F as described in the 
previous section, we generate a realistic observed quasar spectrum that includes the spectral 
resolution and noise approximately matching those in the BOSS survey, following these steps: 

• A new set of pixels for a mock of the physical spectrum in units of flux is constructed, 
covering the whole, fixed wavelength range 3600 A < A < 9000 A, with pixels of constant 
wavelength width AA = 1 A. The width of these pixels in comoving separation is 
therefore changing along the spectrum. 

For each quasar, we compute the mean value of the pixel width (R w ) and the mean 
value of the Point Spread Function(i? p ) in comoving separation, over the region that 
is used for measuring the Lya forest correlation function, 1041 A < A < 1185 A in the 
rest frame. We then convolve our spectrum of 5f in the original pixels of constant 
comoving length with the Point Spread Function that results from the convolution of 
a Gaussian spectral resolution and the pixel width in the final wavelength bins: 



S F (x) 



1 

2^ 



dke ikx S F (k) exp 



k 2 R 2 p ~ 




sva.(kR w /2) 


2 




kR w /2 



(3.1) 



The values of R p , R, 



depend on the quasar redshift, with typical values being in the 
range 0.6 — 0.8 h~ l Mpc. We note that the wavelength dependence of the spectral 
resolution and the pixel width in the BOSS spectrograph are actually quite complex, 
and they should be carefully treated if one is interested in small-scale correlations. 

• Each pixel in the spectrum with constant wavelength bins is assigned the value of F in 
the nearest bin of the spectrum with pixels of constant comoving width. We set F = 1 
for wavelengths outside the Lya forest range. 

• We multiply the spectrum of F by the continuum for each quasar, using the mean 
rest-frame spectra obtained in [15]. A spectrum of physical flux, /(A), is obtained after 
normalizing to match the g magnitude of the quasar. 

• The expected noise variance for the case of the BOSS spectrograph with an exposure 
time of 1 hour is computed at each pixel using the expression 



o%(\) = A + B(\) [/(A) + S (A)] AA , (3.2) 

where s(A) is a typical sky flux in BOSS, A is the read-out noise and B(X) is related to 
the BOSS throughput. These functions have been estimated using BOSS survey code 
[16]. 

• We add a Gaussian random variable with variance af^ to the flux /(A) at each pixel, 
and divide the resulting flux by the continuum to obtain a new spectrum of transmitted 
fraction F (which is no longer restricted to the range < F < 1 because of the noise 
that has been added). 

The detailed properties of the noise in the real survey are more complicated, but this 
simple procedure allows us to approximately study the effect of noise on the correlation 
function measurement. 

An example of mock spectra with continuum and noise added is shown in Figure 1. 

We have generated 50 realizations of this mock survey to obtain the results that are 
presented next. 
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Figure 1. Mock quasar spectrum (solid red), without noise (dashed green) and without Lya absorp- 
tion (dashed blue). 



3.2 Measurement of the Correlation Function 

We estimate the value of the correlation function as the weighted average of the product of 
the 5f variable in all pixel pairs that have a redshift space separation r, angle cosine \jl and 
mean redshift z, which are within a certain bin of width Ar, Afi and Az, which we designate 
as A: 

U = ^ • (3-3) 

The weights Wi for each pixel are to be chosen to minimize the error of the correlation 
function estimator, although special care needs to be taken on any possible bias that they may 
introduce. Our first choice for the weights was to set them equal to the total inverse variance 
in each pixel. The total variance is equal to the sum of the intrinsic Lya forest fluctuations, 
a F (z) = (Sp), and the variance caused by the instrumental noise, erjy-(A)/ [-F(z)C(A)] , so 
the weights are 

<4(Ai) 



If; 



cr F {zi) + 



(3.4) 



{F(z t ) C{\)Y 

These are the same weights that were adopted in [8]. The effect of the intrinsic variance 
correlation in neighboring pixels is ignored, as was done in [8], since this is not expected to 
have any effects on large scales. Note that the noise variance in equation 3.2 applies to the 
flux variable /(A) = F[l + <5f(A)] C(X). The corresponding contribution to the variance of 
5f in equation 3.4 is obtained by dividing by [F(z)C(A)] . 

The correlation function is estimated first on a large number of small enough bins for the 
final result to have converged to the correct value, using 150 bins in r up to r = 150 h~ l Mpc, 
20 bins in [i and 20 bins in z, all of them linearly spaced. Then, for the purpose of plotting 
the results, the correlation function is averaged over all redshifts and compressed into broader 
bins of 10Mpc//i in r, and three bins in (i. This average over the small bins into the broader 
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bins is done using the combined weights of each of the small bins, in the same way for the 
theoretical value of the correlation function and its estimate obtained in each mock. The 
results are plotted in the left panel of Figure 2, where the average estimate from the 50 
mocks is shown as the points with errorbars, and the theoretical value from the model power 
spectrum used to generate the mocks is shown as the curves. 
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Figure 2. Correlation function obtained from the average of 50 mocks of our survey model (points), 
as a function of r in three different bins in /i = cos(8), compared to the input model (lines). The 
errorbars are computed for the average of 50 mocks, from the dispersion between them. Left panel: 
Pixel weights from equation 3.4, which introduce a bias owing to their dependence on the measured 
flux. Right panel: Pixel weights from equation 3.5, which are unbiased. 

The errorbars reflect the expected errors for the average of 50 mocks, and have been 
computed from the dispersion between the 50 mocks. Therefore, the true errors of one of our 
realizations of a survey with an area of 300 square degrees are larger by a factor \/50 — 1 = 7. 
This neglects the edge effects of the survey (the fact that fewer pairs of quasars are found for 
quasars near the edge of the survey area), which are small. For reference, the BOSS survey 
is expected to cover an area of about 30 times that of one of our mocks, with about the same 
quasar density, so the errors for the final BOSS results should be about y/5/3 larger than in 
Figure 2. 

As we can see in the left panel of Figure 2, there is a disagreement between the input 
theory and the measured correlation function that is clearly above the errors. This disagree- 
ment is due to the bias of the weights in equation 3.4: the weights are systematically smaller 
for pixels with a smaller value of Sf- This causes the bias in the estimator for the correlation 
function. To remove this bias, equation 3.4 can be modified to the following form, in which 
the weight is set equal to its value for <5p = and does not depend on the measured flux: 

, J 2 , , , A + BjX^AX [F(z t )C(A t ) + g (A i )] |~ 1 

w i = \ °F\ z i) H 7^ 75 } ■ ( 3 - 5 ) 

The results obtained using the weights w[ are shown in the right panel of Figure 2, showing a 
complete absence of any bias. This example illustrates that particular care needs to be taken 
in the future in the choice of weights to obtain unbiased estimates of the correlation function: 
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a better estimator than the one used in [8] will be required for a proper interpretation of the 
measurements with the final BOSS data set. 

According to this prediction, the amplitude of the BAO peak should be detectable at 
the ~ 5 — a level in bins of 10/i -1 Mpc in r in the correlation function if all the data obtained 
in BOSS is as expected, in agreement with previous authors. 

3.3 Variations in the Survey Strategy 

An application of the mock Lya forest surveys is to calculate the precision achieved in the 
measurement of the correlation function on large scales as a function of any survey properties 
in order to optimize the design of the survey. This study may often be done using a Fisher 
matrix approach without the need to generate survey realizations, but using the mocks 
presented here allows one to include any possible systematic effects in a more complete way. 

1.6 



s 1.4 

O 
ZS 

i 1.2 

1 

O 



co 0.8 
.a 

i 

o 

a5 0.6 
0.4 

20 40 60 80 100 120 140 
r (h~ 1 Mpc) 



i i i r 

fiducial - 
half time exposures 
no noise 
A 9max = 218 



Figure 3. Fractional change in the errorbars of the correlation function, for each radial bin, with 
respect to the fiducial survey, when varying survey parameters. The dashed green line assumes that 
only half as many exposures are obtained (i.e., the signal-to-noise is reduced by a factor y/2), and the 
dotted pink line shows the result of eliminating the faintest 16% of the quasars with 21.8 < g < 22. 
The dotted blue line is for the case with no observational noise. 

Here we study the change in the errorbars of the correlation functionwhen we vary either 
the exposure time or the number of observed quasars within a fixed area. 

We note that the variation of these errorbars with the area of the survey, if we keep the 
quasar density fixed, is basically proportional to the inverse square root of the area, apart 
from the presence of edge effects, which are already small at the BAO scale for our fiducial 
survey with an area of 300 deg 2 . 

Figure 3 shows that the fiducial survey has errors that are reduced by ~ 30 % if the 
observational noise (both photon and read-out noise in the detectors) were entirely eliminated. 
In other words, the errors arising from observational noise and from the intrinsic sampling 
variance in the Lya forest are comparable in our fiducial survey. The best strategy to reduce 
the sampling variance is to aim for the largest possible survey area. Increasing the source 
density is more difficult because one has to search for fainter quasars, which are harder to 
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identify and have larger observational noise for a fixed exposure. The curves in Figure 3 
show that reducing the number of exposures by a factor of 2 degrades the error bars by the 
same amount (10 to 15%) as eliminating the faintest 16% of the quasars, in the magnitude 
range 21.8 < g < 22. Therefore, this shows that maximizing the number of quasars that are 
observed is the best survey strategy, even near the magnitude limit of the BOSS spectroscopic 
quasar survey (see [17]), and even if this is done at the cost of some reduction in the exposure 
time. 

A simple Fisher matrix approach was used in [7] to study the best survey strategy to 
measure the angular diameter distance Da(z) and the Hubble parameter H(z) from the BAO 
wiggles in the power spectrum. In their Figure 1, these authors show that when the survey 
limiting magnitude is reduced from g = 22 to g = 21.8, the fractional error on the angular 
distance Da(z) increases by ~ 20% and the Hubble parameter H(z) increases by ~ 10%, in 
agreement with the 10 — 15% increase of the errorbars that we find (the S/N used for their 
figure is higher than in our mocks, so their improvement for a fainter limiting magnitude 
should be slightly higher than ours). In their Figure 5, [7] show that the fractional error 
on both scales increases by ~ 10% if the (S/N) 2 is reduced by a factor of 2 (equivalent 
to reducing the number of exposures by a factor 2), also in agreement with the 10 — 15% 
increment of the errorbars found in the analysis of our mocks. Similar results were obtained 
by [18]. Our results therefore confirm these earlier studies, where we have now included 
various effects in greater detail, such as the redshift evolution, the expected length of the 
observable spectra and their degree of overlap (these are typically included in Fisher matrix 
calculations in a somewhat abstract, averaged way). 

Our method is highly flexible to allow for a rapid computation of the best strategy 
for survey optimization, including any systematic effects that one may consider and include 
in the mocks in a realistic way. The mocks described here were already used to make a 
preliminary study of the systematic effects of continuum fitting errors, spectroscopic noise, 
metal absorption lines and high column density systems for the first measurements with 
BOSS presented in [8]. 

3.4 Comparison with Gaussian field errorbars 

The forecasts for the accuracy of correlation function measurements with the Fisher matrix 
approach in [7] and [18] assumed that the field can be modeled as a Gaussian variable. We can 
test that this hypothesis does not indeed alter the result by generating mocks of a Gaussian 
field and of a field with the lognormal distribution of equation 2.10, with the same power 
spectrum and generated with the same random numbers. 

In Figure 4 we show that the errorbars in the measurement of the angular-averaged 
correlation function of a Gaussian mock survey are nearly identical to the errorbars of a 
lognormal mock survey. This confirms the expected result that the errorbars are not sensitive 
to the 1-point function and are determined solely by the power spectrum (this does not, 
however, test the importance of longer range gravitational non-linearity [19], although we do 
not expect this to be important on large scales either). 

4 Conclusions 

The method described here is able to create mock correlated spectra of the Lya forest surveys 
mimicking the observed properties of real surveys. Two free functions can be introduced as 
input to the mocks, fixing the one-point distribution and two-point correlation function of 
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Figure 4. Comparison of the errorbars in the angular averaged correlation function expected for 
a survey similar to BOSS, compared to mocks where the field has the same power spectrum but a 
Gaussian distribution. 

the field Sf, which can be made to evolve with redshift. The higher order n-point functions 
that are not reproduced are not expected to significantly affect the measurement errors of 
2-point statistics on the large scales of interest. 

This paper presents only a simple example of the application of these mocks to a survey 
with similar characteristics as BOSS. The technique has already been used in the first analysis 
of BOSS data in [8]. In the future, we plan to improve our methodology to apply it to a 
number of sources as large as the entire BOSS survey, and to include all the observational 
effects in increasing detail. For example, one of the main applications of these mocks is to 
accurately model the effect of high column density systems and metal-line absorption systems 
on the measurement of the Lya forest correlation, which will be described in [10]. 

Recently, [20] and [21] have also presented methods for generating mock surveys of Ly- 
man alpha absorption. Both of these studies generate Gaussian fields in a 3D grid for the 
density and the velocity field, and compute the transmitted flux using a lognormal transfor- 
mation. As mentioned in our Section 2, these methods face computational challenges owing 
to the large amount of memory that is required, which limits their study to small volumes 
(area of 79 deg 2 in [20]) or poor resolution (cell size of 3.2 h 1 Mpc/h in [21]). Furthermore, 
they cannot easily produce large numbers of realizations of a survey with the observed range 
of redshifts that are truly independent of each other, because of the difficulty in running 
many independent 3D simulations and storing their outputs. 

An important difficulty of our method is still the memory needed to store the covariance 
matrices of each k-mode, because their size grows with the square of the number of simulated 
lines of sight. This limits the number of lines of sight to ~ 10 4 when running the code in a 
24-CPU node with 32 Gb of shared memory (the CPU time used is ~ 100 hours). We are 
currently working on an improved version of the code that will be able to avoid this limitation 
by using MPI and sparse matrix algorithms. 
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